1 Introduction

Welcome to the BiocAsia 2018 workshop on ChIP-seq data analysis. Within this workshop you will learn an analysis pipeline to perform differential analysis with ChIP-seq data peaks between experimental conditions. This document will cover the following material:

  1. Data pre-processing
  2. Summarizing sample reads into peaks regions
  3. Filtering regions with low counts
  4. Normalizing the filtered data
  5. Data exploration, including sample clustering and biological variation
  6. Testing for differential binding between experimental conditions

The aim of this workshop is for you to work through this document at your own pace and ask questions when needed. Some code chunks have been left blank for you to fill in, others already contain the required code and you simply need to run it. Additionally, some code chunks that contain code that is not meant to be run and are there for your information only. In such instances the eval argument has been set to FALSE. Please be sure to work through the document in order, as the majority of the analysis is dependent on previous steps.

2 R packages

The following R packages are needed:

library(RColorBrewer)
library(viridis)
library(limma)
library(edgeR)
# library(matrixStats)
library(Glimma)
library(ggplot2)
library(plotly)
# library(pheatmap)
library(org.Mm.eg.db)

3 Example Data

The example data we will be working with today looks at the binding of the protein Nuclear transcription Factor Y subunit Alpha (NFYA) in mouse samples. There are two groups of ChIP samples, each with two biological replicates:

  • Embryonic Stem cells (ES)
  • Terminally differentiated Neurons (TN)

There is also a single input sample. We are interested in identifying differentially bound peaks between the ES and TN groups.

These data are publicly available through Gene Expression Ominbus, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE25532. These five samples are a small part of a larger dataset, the individual IDs of our samples:

Sample GEO Identifier
ES NFYA 1 GSM632038
ES NFYA 2 GSM632039
TN NFYA 1 GSM632057
TN NFYA 2 GSM632058
Input GSM632041

The raw data (FastQ files) can be downloaded through GEO. All FastQ files are stored as SRA files which need to be transformed back to the FastQ format. The easiest way to do this is using GEO’s SRA toolkit fastq-dump function. This function will download the SRA file for each sample and automatically transform it FastQ format for you. In order to use the function you need the SRA identifier for each sample, given in the table below.

Sample SRA Identifier
ES NFYA 1 SRR074398
ES NFYA 2 SRR074399
TN NFYA 1 SRR074417
TN NFYA 2 SRR074418
Input SRR074401

In this class we already have the required data, so you will not need to download these files. If you wish to repeat this exercise on your own and therefore download the data, the structure of the fastq-dump command is as follows.

fastq-dump SRR074398
fastq-dump SRR074399
fastq-dump SRR074417
fastq-dump SRR074418
fastq-dump SRR074401

Be aware that this function is not an R function. You will need to install the SRA toolkit (https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software) on your system prior to running this. Each FastQ file will be named using its respective SRA ID.

Now that we have our data, we construct a data frame that summarizes all the information we have about each sample.

SampleInfo <- data.frame(FileName = c("SRR074398.fastq", "SRR074399.fastq", 
    "SRR074417.fastq", "SRR074418.fastq", "SRR074401.fastq"), SampleName = c("es_1", 
    "es_2", "tn_1", "tn_2", "input"), Group = c("es", "es", "tn", "tn", "input"), 
    stringsAsFactors = FALSE)
SampleInfo

4 Data pre-processing

The data you will be working with has already been processed for you as these steps take a few hours to complete. For those wishing to try on there own, the commands for the data pre-processing are provided in this section. Please note not of these commands/functions are run in R. The majority will require installation of external software.

4.1 FastQ file quality control

All FastQ files were run through FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) to check estimated sequence duplication level and for the presence of adapter sequences. FastQC is not an R package and therefore needs to be installed prior to running. The command to run FastQC will all available FastQ files:

fastqc *.fastq

4.2 Sample alignment

All FastQ files were aligned to the mouse genome, build mm10, using bioconductors Rsubread package (http://bioconductor.org/packages/release/bioc/html/Rsubread.html). The first step in this procedure is to build an index for the mouse genome. For this we need the mouse genome reference sequence in fasta/fa format. The sequence can be downloaded form the NCBI (https://www.ncbi.nlm.nih.gov/genome?term=mus%20musculus).

buildindex(basename = "mm10_index", reference = "mm10.fa", gappedIndex = FALSE, 
    indexSplit = FALSE)

Next we align all samples to our index using Rsubreads align command. The subsequent Bam files are given the same name as the FastQ file with addition of ‘.subread.BAM’ extension to the end of the file name.

align(index = "mm10_index", readfile1 = SampleInfo$FileName, type = "DNA", unique = TRUE, 
    nthreads = 10)

On of the most import aspects of the above command is the type specification. The type argument must be set to ‘DNA’ as we are aligning samples composed of DNA as opposed to RNA. If we were aligning samples generated from an RNA-seq experiment, type would be set to ‘RNA’. Note that Rsubread also has another aligner called subjunc. We cannot use this aligner in this instance as it is specific to RNA alignment.

Now that we have aligned our samples, we can check the proportion of reads that were successfully mapped to the genome using Rsubreads propmapped command.

bam_files <- paste(SampleInfo$FileName, ".subread.BAM")
prop.mapped <- propmapped(bam_files)
prop.mapped$Samples <- SampleInfo$FileName
cbind(SampleName = SampleInfo$SampleName, prop.mapped)
  SampleName         Samples NumTotal NumMapped PropMapped
1       es_1 SRR074398.fastq 32038452  22782623   0.711102
2       es_2 SRR074399.fastq 36749276  26119506   0.710749
3       tn_1 SRR074417.fastq 39283051  28983089   0.737801
4       tn_2 SRR074418.fastq 35423633  27282881   0.770189
5      input SRR074401.fastq 15032584  11492495   0.764506

In each case you can see that we have at least 72% of reads mapping to the genome, which is a reasonable amount.

4.3 Bam file processing

Before we can commence our analysis, we first need to coordinate sort the Bam files and mark PCR duplicate reads. There is no software currently available in Bioconductor or R to perform these functions. We therefore employ an external software package Sambamba (http://lomereiter.github.io/sambamba/). The commands to sort the Bam files:

sambamba sort -t 10 SRR074398.fastq.subread.BAM
sambamba sort -t 10 SRR074399.fastq.subread.BAM
sambamba sort -t 10 SRR074417.fastq.subread.BAM
sambamba sort -t 10 SRR074418.fastq.subread.BAM
sambamba sort -t 10 SRR074401.fastq.subread.BAM

In the above command the -t 10 indicates the number of threads/cores you would like to use. The sorted Bam files are saved using the same names as the original Bam files, except with the ‘.BAM’ removed and extension ‘.sorted.bam’ added. Furthermore, and index file is also produced for each sorted Bam file. Index files have the same name as the sorted Bam file, except with the additional extension ‘.bai’.

PCR duplicate reads are now marked. The commands to mark duplicates reads within the Bam files:

sambamba markdup -t 10 SRR074398.fastq.subread.sorted.bam SRR074398.fastq.subread.sorted.markdup.bam
sambamba markdup -t 10 SRR074399.fastq.subread.sorted.bam SRR074399.fastq.subread.sorted.markdup.bam
sambamba markdup -t 10 SRR074417.fastq.subread.sorted.bam SRR074417.fastq.subread.sorted.markdup.bam
sambamba markdup -t 10 SRR074418.fastq.subread.sorted.bam SRR074418.fastq.subread.sorted.markdup.bam
sambamba markdup -t 10 SRR074401.fastq.subread.sorted.bam SRR074401.fastq.subread.sorted.markdup.bam

For the mark duplicate command, we provide the name/location of the sorted Bam file followed by a new name for the sorted, duplicate marked bam file. This process also gives an index file for the new Bam file.

We can now look at the estimated proportion of duplicate reads for each sample. We can do this using Sambamba’s flagstat command. This command summarize all flag information within each Bam file. The flagstat command requires the sorted, duplicate marked Bam file. We the pipe the results from this command using the > symbol into a results .txt file. The commands:

sambamba flagstat -t 10 SRR074398.fastq.subread.sorted.markdup.bam > SRR074398_flagstat.txt
sambamba flagstat -t 10 SRR074399.fastq.subread.sorted.markdup.bam > SRR074399_flagstat.txt
sambamba flagstat -t 10 SRR074417.fastq.subread.sorted.markdup.bam > SRR074417_flagstat.txt
sambamba flagstat -t 10 SRR074418.fastq.subread.sorted.markdup.bam > SRR074418_flagstat.txt
sambamba flagstat -t 10 SRR074401.fastq.subread.sorted.markdup.bam > SRR074401_flagstat.txt

An example of the contents of the results file:

32038452 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
5011150 + 0 duplicates
22782623 + 0 mapped (71.11%:N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A:N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A:N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

The number of duplicate reads is shown on the fourth line of this file, highlighted in bold. To determine the proportion of duplicates reads, we can take this number of divide the by the total number of reads for this sample, given in the first line. The proportion of duplicate reads for each sample:

Sample Name Flagstat file Proportion duplicate reads
es_1 SRR074398_flagstat.txt 0.156
es_2 SRR074399_flagstat.txt 0.242
tn_1 SRR074417_flagstat.txt 0.231
tn_2 SRR074418_flagstat.txt 0.123
input SRR074401_flagstat.txt 0.138

You can see that all samples have a relatively low level of PCR duplicate reads estimated.

4.4 Peak calling

To call peaks for all of our samples we will use the peak caller Homer (http://homer.ucsd.edu/homer/), an non-R software package. Currently R/Bioconductor does not have any peak calling packages available. To call peaks with Homer, there are three steps:

  1. Create tag directories
  2. Find peaks
  3. Annotate peaks

4.4.1 Tag directories

We are going to crate two tag directories - one for the Input sample and one for the ChIP samples. First lets create the directory for the Input sample using Homers makeTagDirectory command.

makeTagDirectory INPUT/ SRR074401.fastq.subread.sorted.markdup.bam  -tbp 1

This command requires the tag directory name, INPUT/, to be given first, followed by the Bam file name/names to be included. The final component of this command, -tbp 1, tells Homer to ignore duplicate reads. The command for the ChIP samples:

makeTagDirectory ES_TN/ SRR074398.fastq.subread.sorted.markdup.bam SRR074399.fastq.subread.sorted.markdup.bam SRR074417.fastq.subread.sorted.markdup.bam SRR074418.fastq.subread.sorted.markdup.bam -tbp 1

You may be wondering why we are creating a tag directory for all ChIP samples rather than a directory for the ES samples and a separate directory for the TN samples. This is because peak callers do not necessarily control error rates correctly. For those interested, please see this article “De novo detection of differentially bound regions for ChIP-seq data using peaks and windows: controlling error rates correctly” (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4066778/). For this reason, we place all ChIP samples together prior to calling peaks.

4.4.2 Find peaks

We now apply Homers findPeaks function to identify peaks in our ChIP samples compared to the Input sample.

findPeaks ES_TN/ -style factor -F 0 -L 0 -C 0 -o ES_TN_peaks.txt -i INPUT/

For this command we first give the ChIP sample directory ES_TN/. We then specify the type of ChIP experiment that has been performed. In this instance NFYA is a transcription factor and therefore specify -style factor. If you were looking at histone modification ChIP-seq, you would give -style histone. We then give an output file, -o ES_TN_peaks.txt and finally the input sample directory -i INPUT/. This command outputs lots of summary information to the screen for you to read as the analysis progresses. The most import part of this summary can be found in the last 4 or so lines. The final 4 lines of the for our peak finding command:

0.10% FDR Threshold set at 19.0 (poisson pvalue ~ 8.33e-07)
49835 peaks passed threshold
Total Peaks identified = 49835
Centering peaks of size 111 using a fragment length of 111

From this output you can see the setting used by Homer for this analysis. For this workshop we have turned off the majority of the peak filtering in order to obtain a substantial number of peaks to analyse. The filtering arguments, -F -L -C have all been set to 0. You can also see the total number of peaks that were found is 49,835.

4.4.3 Annotate peaks

All 1675 peaks are now annotated with their nearest transcriptional start site (TSS).

annotatePeaks.pl ES_TN_peaks.txt mm10 > ES_TN_peaks_annotated.txt

For this command we give the output file of findPeaks, the genome we aligned the data to, and then pipe the results to an output .txt file. The top 10 peaks:

peaks <- read.delim("ES_TN_peaks_annotated.txt", stringsAsFactors = FALSE)
colnames(peaks)[1] <- "Peak_ID"
peaks[1:10, ]
            Peak_ID            Chr    Start      End Strand Peak.Score Focus.Ratio.Region.Size Annotation           Detailed.Annotation Distance.to.TSS Nearest.PromoterID Entrez.ID Nearest.Unigene
1            chr9-1           chr9 35305390 35305500      +      310.4                   0.580 Intergenic   GSAT_MM|Satellite|Satellite           24340          NR_040715     71133       Mm.158577
2            chr2-1           chr2 98667046 98667156      +      302.3                   0.572 Intergenic   GSAT_MM|Satellite|Satellite         1199444          NM_178725    241568       Mm.241682
3            chr2-2           chr2 98662798 98662908      +      221.5                   0.640 Intergenic   GSAT_MM|Satellite|Satellite         1195196          NM_178725    241568       Mm.241682
4  chrUn_GL456396-1 chrUn_GL456396    11915    12025      +      182.4                   0.728       <NA>   GSAT_MM|Satellite|Satellite              NA               <NA>        NA                
5           chr14-1          chr14 19417521 19417631      +      160.7                   0.638 Intergenic   GSAT_MM|Satellite|Satellite          185011       NM_001024706    432825       Mm.327147
6           chr14-2          chr14 19417782 19417892      +      148.3                   0.587 Intergenic   GSAT_MM|Satellite|Satellite          184750       NM_001024706    432825       Mm.327147
7           chr14-3          chr14 19416110 19416220      +      122.3                   0.589 Intergenic   GSAT_MM|Satellite|Satellite          186422       NM_001024706    432825       Mm.327147
8  chrUn_GL456389-1 chrUn_GL456389    10611    10721      +      120.7                   0.802       <NA>   GSAT_MM|Satellite|Satellite              NA               <NA>        NA                
9           chr14-4          chr14 19418473 19418583      +      118.0                   0.538 Intergenic   GSAT_MM|Satellite|Satellite          184059       NM_001024706    432825       Mm.327147
10           chr2-3           chr2 98665046 98665156      +      111.1                   0.618 Intergenic SYNREP_MM|Satellite|Satellite         1197444          NM_178725    241568       Mm.241682
   Nearest.Refseq    Nearest.Ensembl     Gene.Name          Gene.Alias                  Gene.Description      Gene.Type
1       NR_040715 ENSMUSG00000111746 4933422A05Rik                   -        RIKEN cDNA 4933422A05 gene          ncRNA
2       NM_178725 ENSMUSG00000050587        Lrrc4c 6430556C10Rik|NGL-1 leucine rich repeat containing 4C protein-coding
3       NM_178725 ENSMUSG00000050587        Lrrc4c 6430556C10Rik|NGL-1 leucine rich repeat containing 4C protein-coding
4                                                                                                                      
5    NM_001024706 ENSMUSG00000095024        Gm5458            EG432825               predicted gene 5458 protein-coding
6    NM_001024706 ENSMUSG00000095024        Gm5458            EG432825               predicted gene 5458 protein-coding
7    NM_001024706 ENSMUSG00000095024        Gm5458            EG432825               predicted gene 5458 protein-coding
8                                                                                                                      
9    NM_001024706 ENSMUSG00000095024        Gm5458            EG432825               predicted gene 5458 protein-coding
10      NM_178725 ENSMUSG00000050587        Lrrc4c 6430556C10Rik|NGL-1 leucine rich repeat containing 4C protein-coding

4.5 Read summarization

Now that we have our peak regions, we need to summarize all reads for each ChIP sample that overlap these regions. We do not need to do this for the input sample as it is not of use for the differential analysis.

To summarize reads into counts over the peak regions, we will use Rsubreads featureCounts function. This function utilizes the Bam files and peak region information. First we format our peak region data frame so that it is usable by featureCounts. For featureCounts, the peak region data frame needs to be subset to the columns that contain the Peak ID, chromosome, start, end and strand columns.

peaks_FC <- peaks[, c("Peak_ID", "Chr", "Start", "End", "Strand")]

Furthermore, the column name of the Peak ID column needs to be changed to “GeneID”.

colnames(peaks_FC)[1] <- "GeneID"

Finally the Strand column needs to be changed so that all entries are the “*" symbol as strand was taken into account during the peak calling process.

peaks_FC$Strand <- "*"

Lets print out the top 6 rows of the formatted peak data frame:

head(peaks_FC)
            GeneID            Chr    Start      End Strand
1           chr9-1           chr9 35305390 35305500      *
2           chr2-1           chr2 98667046 98667156      *
3           chr2-2           chr2 98662798 98662908      *
4 chrUn_GL456396-1 chrUn_GL456396    11915    12025      *
5          chr14-1          chr14 19417521 19417631      *
6          chr14-2          chr14 19417782 19417892      *

We are now ready to summarize are reads into counts. The results are saved in the RData object ‘ChIP_workshop_PeakCounts.RData’.

SampleInfo <- SampleInfo[-5, ]
bam_files <- paste0(SampleInfo$FileName, ".subread.BAM")
FC <- featureCounts(files = bam_files, annot.ext = peaks_FC, useMetaFeatures = FALSE, 
    countMultiMappingReads = FALSE)
colnames(FC$counts) <- SampleInfo$SampleName
colnames(FC$stat)[-1] <- SampleInfo$SampleName
save(FC, file = "ChIP_workshop_PeakCounts.RData")

5 Raw data

We are now at the point where you will begin the practical component of this workshop. Please ask for help at any time, and remember to consult the function help files.

Lets load the peak information and the featureCounts output, and remove the input sample from the SampleInfo data frame created at the beginning of the document.

peaks <- read.delim("ES_TN_peaks_annotated.txt", stringsAsFactors = FALSE)
colnames(peaks)[1] <- "Peak_ID"
load("ChIP_workshop_PeakCounts.RData")
SampleInfo <- SampleInfo[-5, ]

To ensure the peak information loaded correctly and to see the structure of this data frame, lets print the first 6 lines of this object below.

head(peaks)

The results from featureCounts stored in the FC object are given as a list containing

  • a counts matrix named counts
  • annotation information named annotation
  • sample information named targets
  • mapping statistics names stat

Lets have a look at the mapping statistics:

FC$stat

From these results you can see that the majority of reads in each sample were not assigned to a region as there were no features/peak region in their location. This is unsurprising as transcription factor peaks are typically quite narrow and will therefore only take up a small portion of the genome.

Now lets extract the counts matrix within the featureCounts list and save it to a variable called counts. We will then have a look at the first 10 rows of the counts matrix.

counts <- FC$counts
counts[1:10, ]
                  es_1   es_2  tn_1  tn_2
chr9-1           31039  29682  7080  6978
chr2-1           99119 109979 29737 29499
chr2-2           50669  57485 15399 13601
chrUn_GL456396-1 20431  25108  5267  4351
chr14-1           7434   8664  2327  1855
chr14-2          11940  11295  2771  2244
chr14-3           4800   5597  1587  1352
chrUn_GL456389-1 23246  24868  5988  5897
chr14-4           2052   2396   539   525
chr2-3            4068   4751  2175  1913

Now that we have the count data, we can begin the differential analysis.

6 Digitial gene expression list

The first step in the differential analysis is to build Digital Gene Expression (DGE) list. We will use this list throughout the remainder of the analysis as it has been designed for use with the limma and edgeR packages. We create this object using the command DGEList. This command requires the following items:

  • a counts matrix, entered using the counts parameter
  • sample group information (found in the SampleInfo data frame), entered using the group parameter
  • peak region information. The peak region information, within the peaks data frame, is entered using the gene parameter within the DGEList command

Have a go at creating this a DGEList, storing it in a variable called DGE, and printing it below. Please remember to ask for help if you are having trouble.

DGE <- DGEList(counts = counts, group = SampleInfo$Group, genes = peaks)
DGE
An object of class "DGEList"
$counts
                  es_1   es_2  tn_1  tn_2
chr9-1           31039  29682  7080  6978
chr2-1           99119 109979 29737 29499
chr2-2           50669  57485 15399 13601
chrUn_GL456396-1 20431  25108  5267  4351
chr14-1           7434   8664  2327  1855
49830 more rows ...

$samples
     group lib.size norm.factors
es_1    es   647471            1
es_2    es   821618            1
tn_1    tn   603555            1
tn_2    tn   511832            1

$genes
                          Peak_ID            Chr    Start      End Strand
chr9-1                     chr9-1           chr9 35305390 35305500      +
chr2-1                     chr2-1           chr2 98667046 98667156      +
chr2-2                     chr2-2           chr2 98662798 98662908      +
chrUn_GL456396-1 chrUn_GL456396-1 chrUn_GL456396    11915    12025      +
chr14-1                   chr14-1          chr14 19417521 19417631      +
                 Peak.Score Focus.Ratio.Region.Size Annotation
chr9-1                310.4                   0.580 Intergenic
chr2-1                302.3                   0.572 Intergenic
chr2-2                221.5                   0.640 Intergenic
chrUn_GL456396-1      182.4                   0.728       <NA>
chr14-1               160.7                   0.638 Intergenic
                         Detailed.Annotation Distance.to.TSS
chr9-1           GSAT_MM|Satellite|Satellite           24340
chr2-1           GSAT_MM|Satellite|Satellite         1199444
chr2-2           GSAT_MM|Satellite|Satellite         1195196
chrUn_GL456396-1 GSAT_MM|Satellite|Satellite              NA
chr14-1          GSAT_MM|Satellite|Satellite          185011
                 Nearest.PromoterID Entrez.ID Nearest.Unigene
chr9-1                    NR_040715     71133       Mm.158577
chr2-1                    NM_178725    241568       Mm.241682
chr2-2                    NM_178725    241568       Mm.241682
chrUn_GL456396-1               <NA>        NA                
chr14-1                NM_001024706    432825       Mm.327147
                 Nearest.Refseq    Nearest.Ensembl     Gene.Name
chr9-1                NR_040715 ENSMUSG00000111746 4933422A05Rik
chr2-1                NM_178725 ENSMUSG00000050587        Lrrc4c
chr2-2                NM_178725 ENSMUSG00000050587        Lrrc4c
chrUn_GL456396-1                                                
chr14-1            NM_001024706 ENSMUSG00000095024        Gm5458
                          Gene.Alias                  Gene.Description
chr9-1                             -        RIKEN cDNA 4933422A05 gene
chr2-1           6430556C10Rik|NGL-1 leucine rich repeat containing 4C
chr2-2           6430556C10Rik|NGL-1 leucine rich repeat containing 4C
chrUn_GL456396-1                                                      
chr14-1                     EG432825               predicted gene 5458
                      Gene.Type
chr9-1                    ncRNA
chr2-1           protein-coding
chr2-2           protein-coding
chrUn_GL456396-1               
chr14-1          protein-coding
49830 more rows ...

7 Filtering

To reduce noise in the data we need to filter peak regions with low counts. For this we will use edgeR’s filterByExpr function. To apply this function you will need the DGE object and the sample group information, entered into the filterByExpr command using the group option. Have a go at applying this function and storing the results in the variable is.exprs.

is.exprs <- filterByExpr(DGE, group = SampleInfo$Group)

To see how many regions are deemed to have counts at a reasonable level, table the is.exprs object.

table(is.exprs)
is.exprs
FALSE  TRUE 
40709  9126 

We can now filter the DGE list.

DGE <- DGE[is.exprs, , keep.lib.sizes = FALSE]

You’ll notice that I have included an additional option when filtering the DGE list - keep.lib.sizes=FALSE. This option is included so the library sizes of our samples (total number of reads in a sample) is re-calculated following filtering. After filtering you should have 9126 peak regions (rows) remaining. You can check this using the nrow(DGE) command below.

nrow(DGE)
[1] 9126

8 Normalisation

We apply TMM normalization to the filtered data in order to account for difference in library size and sample composition. This function calculates a scaling factor for each sample that, ideally, should be around/close to 1. You can apply TMM normalization using the calcNormFactors function with the method set to TMM. Apply this to you DGE list, remembering to re-save the list back to DGE. Then print out the normalization factors which can be found in the samples section of the DGE object.

DGE <- calcNormFactors(DGE, method = "TMM")
DGE$samples

Your normalization factors should range from approximately 0.467 to 1.952.

9 Data exploration

9.1 Sample clustering

Now that we have filtered and normalized the data, we can begin to explore it. We start by looking at the sample clustering with a Multi-Dimension Scaling (MDS) plot. The MDS plot is created in an unsupervised manor such that each sample is plotted according to the distance between it and all other samples. Therefore samples that sit closer together are more similar, while those that are further apart are more different. You can created an MDS plot of the data using the plotMDS function. By default the distance is calculated for each pair of samples based on the top 500 most variable regions between that pair. Create an MDS plot of the data, try coloring the points according to their group.

plotMDS(DGE, col = c(rep("green", 2), rep("orange", 2)))

On the MDS plot you should see the samples being separated by group - ES and TN; in the first dimension. The second dimension should show a separation of the ES samples, while the TN samples sit close together. Ideally we would like replicate samples to cluster closely together, while the groups are far apart. We have mostly achieved this with the NFYA data, although the ES samples are further apart than we would typically like. This indicates that there is some variance between these samples, which we will explore later.

9.2 Design matrix

We now need to construct the design matrix to be used for the analysis. This matrix mathematically represents the design of the data set, typically specifying which samples belong in each group; and is used to form the linear models that will use to test for differential binding between the ES and TN groups. This matrix is created and printed below.

design <- model.matrix(~0 + DGE$samples$group)
colnames(design) <- c("es", "tn")
row.names(design) <- colnames(DGE)
design
     es tn
es_1  1  0
es_2  1  0
tn_1  0  1
tn_2  0  1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$`DGE$samples$group`
[1] "contr.treatment"

9.3 Biological variation

We now explore the biological variation between replicate samples. This variation is estimated using the estimateDisp function, which requires the DGE list and design matrix. Have a go at applying this function, saving the results in an object called disp. When running estimateDisp, be sure to set the robust argument to TRUE as this will protect against extreme/outlying observations. We apply robust due to the variance seen within the ES group in the MDS plot. Print the disp object once you have created it.

disp <- estimateDisp(y = DGE, design = design, robust = TRUE)
disp
An object of class "DGEList"
$counts
                  es_1   es_2  tn_1  tn_2
chr9-1           31039  29682  7080  6978
chr2-1           99119 109979 29737 29499
chr2-2           50669  57485 15399 13601
chrUn_GL456396-1 20431  25108  5267  4351
chr14-1           7434   8664  2327  1855
9121 more rows ...

$samples
     group lib.size norm.factors
es_1    es   477699    0.4667213
es_2    es   538050    0.5761282
tn_1    tn   337503    1.9518600
tn_2    tn   283950    1.9053490

$genes
                          Peak_ID            Chr    Start      End Strand
chr9-1                     chr9-1           chr9 35305390 35305500      +
chr2-1                     chr2-1           chr2 98667046 98667156      +
chr2-2                     chr2-2           chr2 98662798 98662908      +
chrUn_GL456396-1 chrUn_GL456396-1 chrUn_GL456396    11915    12025      +
chr14-1                   chr14-1          chr14 19417521 19417631      +
                 Peak.Score Focus.Ratio.Region.Size Annotation
chr9-1                310.4                   0.580 Intergenic
chr2-1                302.3                   0.572 Intergenic
chr2-2                221.5                   0.640 Intergenic
chrUn_GL456396-1      182.4                   0.728       <NA>
chr14-1               160.7                   0.638 Intergenic
                         Detailed.Annotation Distance.to.TSS
chr9-1           GSAT_MM|Satellite|Satellite           24340
chr2-1           GSAT_MM|Satellite|Satellite         1199444
chr2-2           GSAT_MM|Satellite|Satellite         1195196
chrUn_GL456396-1 GSAT_MM|Satellite|Satellite              NA
chr14-1          GSAT_MM|Satellite|Satellite          185011
                 Nearest.PromoterID Entrez.ID Nearest.Unigene
chr9-1                    NR_040715     71133       Mm.158577
chr2-1                    NM_178725    241568       Mm.241682
chr2-2                    NM_178725    241568       Mm.241682
chrUn_GL456396-1               <NA>        NA                
chr14-1                NM_001024706    432825       Mm.327147
                 Nearest.Refseq    Nearest.Ensembl     Gene.Name
chr9-1                NR_040715 ENSMUSG00000111746 4933422A05Rik
chr2-1                NM_178725 ENSMUSG00000050587        Lrrc4c
chr2-2                NM_178725 ENSMUSG00000050587        Lrrc4c
chrUn_GL456396-1                                                
chr14-1            NM_001024706 ENSMUSG00000095024        Gm5458
                          Gene.Alias                  Gene.Description
chr9-1                             -        RIKEN cDNA 4933422A05 gene
chr2-1           6430556C10Rik|NGL-1 leucine rich repeat containing 4C
chr2-2           6430556C10Rik|NGL-1 leucine rich repeat containing 4C
chrUn_GL456396-1                                                      
chr14-1                     EG432825               predicted gene 5458
                      Gene.Type
chr9-1                    ncRNA
chr2-1           protein-coding
chr2-2           protein-coding
chrUn_GL456396-1               
chr14-1          protein-coding
9121 more rows ...

$design
     es tn
es_1  1  0
es_2  1  0
tn_1  0  1
tn_2  0  1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$`DGE$samples$group`
[1] "contr.treatment"


$common.dispersion
[1] 0.06764929

$trended.dispersion
[1] 0.04895039 0.04776403 0.04838952 0.04926345 0.05041929
9121 more elements ...

$tagwise.dispersion
[1] 0.04724277 0.04098935 0.03877217 0.03710657 0.03886940
9121 more elements ...

$AveLogCPM
[1] 15.98021 17.77794 16.81498 15.52523 14.05787
9121 more elements ...

$trend.method
[1] "locfit"

$prior.df
[1] 8.160687 8.220096 8.284002 8.323355 8.323355
9121 more elements ...

$prior.n
[1] 4.080343 4.110048 4.142001 4.161677 4.161677
9121 more elements ...

$span
[1] 0.3055144

Lets now have a look at the biological coefficient of variation (BCV) for the data set. The BCV indicates how variable replicate samples are. Ideally we want this value to be small, generally below 0.1 for mouse data. To calculated the BCV for the data, take the square root of the common dispersion value within the disp object.

BCV <- sqrt(disp$common.dispersion)
BCV
[1] 0.2600948

You should have a BCV value around 0.26, which is slightly high. This is most likely caused by the variance we saw between the ES replicate samples on the MDS plot. You can also calculate a BCV for each peak region using the square root of the tagwise dispersion values in the disp object. Lets look at a summary of BCVs for the peak regions to give us an idea of how variable our data is.

summary(sqrt(disp$tagwise.dispersion))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.009882 0.219719 0.256114 0.275484 0.307188 0.997148 

We can also create a plot of the replicate sample variability using the plotBCV function on the disp object. This plot is known as a BCV plot. Create such a plot now.

plotBCV(disp)

This plot shows the BCV for each peak region (y-axis) plotted against their average \(\log_2\)-counts per million reads (CPM). The common BCV is indicated by the red line, while the trend in the data is shown by the blue line. An interactive version of this plot can be created using the following commands:

glXYPlot(x = disp$AveLogCPM, y = sqrt(disp$tagwise.dispersion), counts = DGE$counts, 
    transform = TRUE, groups = DGE$samples$group, samples = colnames(DGE), anno = DGE$genes, 
    xlab = "Average log CPM", ylab = "Biological coefficient of variation", 
    side.main = "Peak_ID", html = "Interactive_BCVplot")

With this plot you can explore the change in counts for the peak regions and how this affects the BCV plot. If this plot does not appear automatically, go the “glimma-plots” folder and open it direction from there.

The variance displayed by these data are somewhat unusual. You can see that there are some regions where the within group variation is very high, while others where it is almost completely gone. However this tends to occur in regions with relatively low counts. Regions that show higher counts show a much more stable variance.

10 Differential binding analysis

We are now ready to test for differential binding between the ES and TN groups. For this analysis we will apply edgeR’s Quasi-likelihood pipeline as it should handle the unusual variance displayed by these data better than the limma analysis pipelines.

First we need to apply the glmQLFit function to the data, which will fit a quasi-likelihood negative binomial generalized log-linear model to the data. It also conducts empirical Bayes moderation of the region specific quasi-likelihood dispersion’s, shrinking them towards the data trend. For this function you need to provide the disp object so the variance estimates are included in the analysis, as well as the design matrix. Furthermore, as we applied the robust parameter when estimating the biological variation , we also need to apply this parameter here. Have a go at fitting the quasi-likelihood model now, storing in a variable called fit. Please remember to ask questions if you are unsure.

fit <- glmQLFit(y = disp, design = design, robust = TRUE)

Now that we have modeled the data, we can visualize the distribution of the quasi-likelihood dispersion’s after empirical bayes shrinkage using the plotQLDisp function. Apply this function to the modeled data.

plotQLDisp(glmfit = fit)

The plot displays the quarter-root of the quasi-likelihood dispersion’s for all peak regions before and after shrinkage (red points).

We now need to specify the comparison we would like to make with the data using a contrast matrix. Recall that we are interested in determining which regions are different between the ES and TN groups. We give this in a contrasts matrix using the command below.

contrast_matrix <- makeContrasts(ES_vs_TN = es - tn, levels = design)

The contrasts matrix:

contrast_matrix
      Contrasts
Levels ES_vs_TN
    es        1
    tn       -1

You can see that are subtracting the TN samples from the ES samples. We now test for differential binding using the glmQLFTest function. To apply this function, you need to supply the fit object together with the contrasts matrix. Save the results in an object called QLFtest.

QLFtest <- glmQLFTest(glmfit = fit, contrast = contrast_matrix)

For this analysis we will set the false discovery rate (FDR) to 0.05. Therefore for a peak region to be differentially bound (DB) its FDR must be less than 0.05. We can determine which regions pass this criteria using the decideTests function. Apply this function now, saving the results in a variable called DT, and print out the first 100 items of the results.

DT <- decideTests(QLFtest)
DT[1:100, ]
           chr9-1            chr2-1            chr2-2  chrUn_GL456396-1 
                1                 1                 1                 1 
          chr14-1           chr14-2           chr14-3  chrUn_GL456389-1 
                1                 1                 1                 1 
          chr14-4            chr2-3            chr4-1           chr14-5 
                1                 1                 1                 1 
 chrUn_GL456392-1           chr14-7            chr2-4           chr14-6 
                1                 1                 1                 1 
          chr14-8            chr2-5           chr14-9          chr14-10 
                1                 1                 1                 1 
 chrUn_GL456392-2            chr9-2  chrUn_JH584304-1            chrX-1 
                1                 1                 1                 1 
 chrUn_GL456396-2            chr6-1  chrUn_GL456396-3            chr2-6 
                1                 1                 1                 1 
 chrUn_JH584304-2  chrUn_GL456392-3            chr9-3           chr12-1 
                1                 1                 1                 1 
 chrUn_JH584304-3  chrUn_GL456390-1            chr9-4  chrUn_JH584304-4 
                1                 1                 1                 1 
           chr4-2            chr9-5  chrUn_GL456389-2  chrUn_GL456390-2 
                1                 1                 1                 1 
 chrUn_GL456390-3            chr8-1            chr9-6  chrUn_JH584304-5 
                1                 1                 1                 1 
 chrUn_GL456392-4            chr9-8  chrUn_GL456396-4  chrUn_GL456390-4 
                1                 1                 1                 1 
           chr9-9  chrUn_GL456392-5            chr9-7           chr9-10 
                1                 1                 1                 1 
 chrUn_GL456390-5  chrUn_GL456392-6           chr18-1  chrUn_GL456393-1 
                1                 1                 1                 1 
 chrUn_GL456392-7  chrUn_GL456383-2  chrUn_GL456390-6           chr9-11 
                1                 1                 1                 1 
          chr12-2           chr19-1  chrUn_JH584304-6  chrUn_GL456383-1 
                1                 1                 1                 1 
          chr13-1            chr4-3           chr18-2  chrUn_GL456392-8 
                1                 0                 0                 1 
           chr5-1  chrUn_GL456390-8            chr8-2  chrUn_GL456392-9 
                0                 1                 0                 1 
 chrUn_GL456390-9  chrUn_GL456383-3           chr9-12           chr9-13 
                1                 1                 1                 1 
           chr2-7            chr8-3 chrUn_GL456392-10  chrUn_GL456383-4 
                0                 0                 1                 1 
 chrUn_GL456383-5           chr9-14            chr8-4           chr9-15 
                1                 0                 0                 0 
          chr11-1  chrUn_GL456390-7           chr16-1  chrUn_JH584304-7 
                1                 1                 1                 1 
           chr2-8            chr1-1 chrUn_GL456390-10 chrUn_GL456392-11 
                0                 0                 1                 1 
 chrUn_GL456396-6  chrUn_GL456383-6          chr14-11           chr15-1 
                1                 1                 0                 1 
           chr5-2  chrUn_GL456396-5           chr11-2  chrUn_GL456396-7 
                0                 1                 0                 1 

You can see that every peak region has been assigned a number - either 1, 0 or -1. Those regions with a 1 are up-regulated in ES compared to TN, those with a -1 are down-regulated in ES compared to TN, and those with a 0 are not significant. A summary of the results can be generated by applying the summary command to DT.

summary(DT)
       1*es -1*tn
Down          264
NotSig       7803
Up           1059

You should see that, in total, you have 264 regions that are down-regulated and 1059 regions that are up-regulated. The direction of regulation is determined by the \(\log_2\)-fold change for that region. Those regions with a positive \(\log_2\)-fold change are up-regulated, while those with a negative \(\log_2\)-fold change are down-regulated.

Lets have a look at the top results for our analysis. We can do this using the topTags function. By default topTags shows the top 10 results. You can change this by adjusting the parameter n. If you would like to see results for all regions, you can set n=Inf. Display the top 20 results for our analysis.

topTags(QLFtest, n = 20)
Coefficient:  1*es -1*tn 
                          Peak_ID            Chr    Start      End Strand
chrUn_GL456396-1 chrUn_GL456396-1 chrUn_GL456396    11915    12025      +
chr2-2                     chr2-2           chr2 98662798 98662908      +
chr14-6                   chr14-6          chr14 19419264 19419374      +
chr2-1                     chr2-1           chr2 98667046 98667156      +
chr14-1                   chr14-1          chr14 19417521 19417631      +
chrUn_GL456389-1 chrUn_GL456389-1 chrUn_GL456389    10611    10721      +
chr14-2                   chr14-2          chr14 19417782 19417892      +
chr9-1                     chr9-1           chr9 35305390 35305500      +
chr4-1                     chr4-1           chr4 70378129 70378239      +
chrUn_GL456392-1 chrUn_GL456392-1 chrUn_GL456392    23146    23256      +
chrUn_JH584304-5 chrUn_JH584304-5 chrUn_JH584304   103497   103607      +
chr9-12                   chr9-12           chr9  3027716  3027826      +
chr10-12                 chr10-12          chr10 55215106 55215216      +
chr9-7                     chr9-7           chr9  3004469  3004579      +
chr14-3                   chr14-3          chr14 19416110 19416220      +
chrUn_GL456392-2 chrUn_GL456392-2 chrUn_GL456392    22528    22638      +
chrUn_GL456389-4 chrUn_GL456389-4 chrUn_GL456389    17134    17244      +
chr14-4                   chr14-4          chr14 19418473 19418583      +
chrUn_JH584304-3 chrUn_JH584304-3 chrUn_JH584304    98068    98178      +
chr14-5                   chr14-5          chr14 19419626 19419736      +
                 Peak.Score Focus.Ratio.Region.Size
chrUn_GL456396-1      182.4                   0.728
chr2-2                221.5                   0.640
chr14-6                98.1                   0.506
chr2-1                302.3                   0.572
chr14-1               160.7                   0.638
chrUn_GL456389-1      120.7                   0.802
chr14-2               148.3                   0.587
chr9-1                310.4                   0.580
chr4-1                109.4                   0.767
chrUn_GL456392-1      100.8                   0.766
chrUn_JH584304-5       46.4                   0.975
chr9-12                33.5                   0.973
chr10-12               17.8                   0.970
chr9-7                 39.8                   0.966
chr14-3               122.3                   0.589
chrUn_GL456392-2       79.5                   0.608
chrUn_GL456389-4       28.4                   0.754
chr14-4               118.0                   0.538
chrUn_JH584304-3       58.9                   0.801
chr14-5               105.9                   0.696
                                         Annotation
chrUn_GL456396-1                               <NA>
chr2-2                                   Intergenic
chr14-6                                  Intergenic
chr2-1                                   Intergenic
chr14-1                                  Intergenic
chrUn_GL456389-1                               <NA>
chr14-2                                  Intergenic
chr9-1                                   Intergenic
chr4-1           intron (NM_145990, intron 4 of 36)
chrUn_GL456392-1                               <NA>
chrUn_JH584304-5                         Intergenic
chr9-12                                  Intergenic
chr10-12                                 Intergenic
chr9-7                                   Intergenic
chr14-3                                  Intergenic
chrUn_GL456392-2                               <NA>
chrUn_GL456389-4                               <NA>
chr14-4                                  Intergenic
chrUn_JH584304-3                         Intergenic
chr14-5                                  Intergenic
                         Detailed.Annotation Distance.to.TSS
chrUn_GL456396-1 GSAT_MM|Satellite|Satellite              NA
chr2-2           GSAT_MM|Satellite|Satellite         1195196
chr14-6          GSAT_MM|Satellite|Satellite          183268
chr2-1           GSAT_MM|Satellite|Satellite         1199444
chr14-1          GSAT_MM|Satellite|Satellite          185011
chrUn_GL456389-1 GSAT_MM|Satellite|Satellite              NA
chr14-2          GSAT_MM|Satellite|Satellite          184750
chr9-1           GSAT_MM|Satellite|Satellite           24340
chr4-1           GSAT_MM|Satellite|Satellite           32251
chrUn_GL456392-1 GSAT_MM|Satellite|Satellite              NA
chrUn_JH584304-5 GSAT_MM|Satellite|Satellite          -43863
chr9-12          GSAT_MM|Satellite|Satellite           10972
chr10-12                          Intergenic         -891756
chr9-7           GSAT_MM|Satellite|Satellite           34219
chr14-3          GSAT_MM|Satellite|Satellite          186422
chrUn_GL456392-2 GSAT_MM|Satellite|Satellite              NA
chrUn_GL456389-4 GSAT_MM|Satellite|Satellite              NA
chr14-4          GSAT_MM|Satellite|Satellite          184059
chrUn_JH584304-3 GSAT_MM|Satellite|Satellite          -38434
chr14-5          GSAT_MM|Satellite|Satellite          182906
                 Nearest.PromoterID Entrez.ID Nearest.Unigene
chrUn_GL456396-1               <NA>        NA                
chr2-2                    NM_178725    241568       Mm.241682
chr14-6                NM_001024706    432825       Mm.327147
chr2-1                    NM_178725    241568       Mm.241682
chr14-1                NM_001024706    432825       Mm.327147
chrUn_GL456389-1               <NA>        NA                
chr14-2                NM_001024706    432825       Mm.327147
chr9-1                    NR_040715     71133       Mm.158577
chr4-1                 NM_001313762    214444       Mm.370777
chrUn_GL456392-1               <NA>        NA                
chrUn_JH584304-5          NR_003518     66776       Mm.247625
chr9-12                   NR_039546 100628572                
chr10-12               NM_001163833     73390       Mm.181475
chr9-7                    NR_039546 100628572                
chr14-3                NM_001024706    432825       Mm.327147
chrUn_GL456392-2               <NA>        NA                
chrUn_GL456389-4               <NA>        NA                
chr14-4                NM_001024706    432825       Mm.327147
chrUn_JH584304-3          NR_003518     66776       Mm.247625
chr14-5                NM_001024706    432825       Mm.327147
                 Nearest.Refseq    Nearest.Ensembl     Gene.Name
chrUn_GL456396-1                                                
chr2-2                NM_178725 ENSMUSG00000050587        Lrrc4c
chr14-6            NM_001024706 ENSMUSG00000095024        Gm5458
chr2-1                NM_178725 ENSMUSG00000050587        Lrrc4c
chr14-1            NM_001024706 ENSMUSG00000095024        Gm5458
chrUn_GL456389-1                                                
chr14-2            NM_001024706 ENSMUSG00000095024        Gm5458
chr9-1                NR_040715 ENSMUSG00000111746 4933422A05Rik
chr4-1                NM_145990 ENSMUSG00000039298      Cdk5rap2
chrUn_GL456392-1                                                
chrUn_JH584304-5      NR_003518                         Pisd-ps3
chr9-12               NR_039546 ENSMUSG00000106183       Mir101c
chr10-12              NM_133751 ENSMUSG00000047669        Msl3l2
chr9-7                NR_039546 ENSMUSG00000106183       Mir101c
chr14-3            NM_001024706 ENSMUSG00000095024        Gm5458
chrUn_GL456392-2                                                
chrUn_GL456389-4                                                
chr14-4            NM_001024706 ENSMUSG00000095024        Gm5458
chrUn_JH584304-3      NR_003518                         Pisd-ps3
chr14-5            NM_001024706 ENSMUSG00000095024        Gm5458
                                 Gene.Alias
chrUn_GL456396-1                           
chr2-2                  6430556C10Rik|NGL-1
chr14-6                            EG432825
chr2-1                  6430556C10Rik|NGL-1
chr14-1                            EG432825
chrUn_GL456389-1                           
chr14-2                            EG432825
chr9-1                                    -
chr4-1           2900018K03Rik|an|mKIAA1633
chrUn_GL456392-1                           
chrUn_JH584304-5              4933439C20Rik
chr9-12               mir-101c|mmu-mir-101c
chr10-12                      1700060H10Rik
chr9-7                mir-101c|mmu-mir-101c
chr14-3                            EG432825
chrUn_GL456392-2                           
chrUn_GL456389-4                           
chr14-4                            EG432825
chrUn_JH584304-3              4933439C20Rik
chr14-5                            EG432825
                                               Gene.Description
chrUn_GL456396-1                                               
chr2-2                        leucine rich repeat containing 4C
chr14-6                                     predicted gene 5458
chr2-1                        leucine rich repeat containing 4C
chr14-1                                     predicted gene 5458
chrUn_GL456389-1                                               
chr14-2                                     predicted gene 5458
chr9-1                               RIKEN cDNA 4933422A05 gene
chr4-1             CDK5 regulatory subunit associated protein 2
chrUn_GL456392-1                                               
chrUn_JH584304-5 phosphatidylserine decarboxylase, pseudogene 3
chr9-12                                           microRNA 101c
chr10-12             male-specific lethal 3-like 2 (Drosophila)
chr9-7                                            microRNA 101c
chr14-3                                     predicted gene 5458
chrUn_GL456392-2                                               
chrUn_GL456389-4                                               
chr14-4                                     predicted gene 5458
chrUn_JH584304-3 phosphatidylserine decarboxylase, pseudogene 3
chr14-5                                     predicted gene 5458
                      Gene.Type    logFC    logCPM        F       PValue
chrUn_GL456396-1                3.428151 15.525234 198.8455 4.457232e-08
chr2-2           protein-coding 3.088596 16.814980 170.1842 9.955614e-08
chr14-6          protein-coding 3.719418 11.942550 157.1406 1.474175e-07
chr2-1           protein-coding 3.003682 17.777936 158.4235 1.500904e-07
chr14-1          protein-coding 3.138163 14.057867 149.1762 1.818572e-07
chrUn_GL456389-1                3.206187 15.641300 151.7499 1.846878e-07
chr14-2          protein-coding 3.429055 14.584950 148.1514 2.192914e-07
chr9-1                    ncRNA 3.312836 15.980215 144.9162 2.436875e-07
chr4-1           protein-coding 3.484073 12.524366 142.6051 2.489666e-07
chrUn_GL456392-1                3.483618 12.229255 141.8879 2.550596e-07
chrUn_JH584304-5         pseudo 3.452874 12.002358 139.0015 2.682134e-07
chr9-12                   ncRNA 3.449140 10.912090 135.2863 2.918558e-07
chr10-12         protein-coding 4.016474  9.738283 136.9667 3.020388e-07
chr9-7                    ncRNA 3.867770 10.754689 136.5393 3.065885e-07
chr14-3          protein-coding 3.011539 13.440163 132.2577 3.255203e-07
chrUn_GL456392-2                3.476863 11.768739 134.5556 3.288074e-07
chrUn_GL456389-4                3.559722 10.437111 130.0210 3.533834e-07
chr14-4          protein-coding 3.242780 12.189630 129.9421 3.618368e-07
chrUn_JH584304-3         pseudo 3.523108 10.933516 129.6937 3.740972e-07
chr14-5          protein-coding 3.203246 13.301102 128.6911 4.067129e-07
                          FDR
chrUn_GL456396-1 0.0001646722
chr2-2           0.0001646722
chr14-6          0.0001646722
chr2-1           0.0001646722
chr14-1          0.0001646722
chrUn_GL456389-1 0.0001646722
chr14-2          0.0001646722
chr9-1           0.0001646722
chr4-1           0.0001646722
chrUn_GL456392-1 0.0001646722
chrUn_JH584304-5 0.0001646722
chr9-12          0.0001646722
chr10-12         0.0001646722
chr9-7           0.0001646722
chr14-3          0.0001646722
chrUn_GL456392-2 0.0001646722
chrUn_GL456389-4 0.0001646722
chr14-4          0.0001646722
chrUn_JH584304-3 0.0001646722
chr14-5          0.0001646722

The table above should give the peak information, we well as the following statistics

Statistic Description
logFC region \(\log_2\)-fold change
logCPM region average \(\log_2\)-CPM
F Quasi-likelihood F-statistic used for DB testing
PValue region raw p-value
FDR region false discovery rate

We can write the results to a file using the following command:

results <- topTags(QLFtest, n = Inf)
write.csv(results$table, file = "Results.csv", row.names = FALSE)

To visualize our results we can use a mean-difference plot. The mean-difference graph plots the region average \(\log_2\)-CPM versus its logFC. All regions that were significant are colored according to their direction of regulation - red and blue for up and down respectively. This plot is created using the plotMD function, utilizing the QLFtest object. Create the mean-difference plot.

plotMD(QLFtest)

An interactive version of this plot can be created using glMDPlot command. Have a go at creating this plot, using the command for the interactive BCV plot as a guide and consulting the help file.

glMDPlot(x = QLFtest, counts = DGE$counts, transform = TRUE, anno = DGE$genes, 
    groups = DGE$samples$group, status = DT, samples = colnames(DGE), side.main = "Peak_ID", 
    html = "Interactive_MDplot")

For the final part the analysis, lets create a heatmap of our DB regions. First we calculate the \(\log_2\)-CPM for all regions within each sample using the cpm function. When applying this function, be sure to specify log=TRUE. Store the results in a variable called logCPM.

logCPM <- cpm(DGE, log = TRUE)

Next subset the logCPM matrix so that only significant regions - those with an FDR<0.05; are included.

logCPM_sig_res <- logCPM[row.names(logCPM) %in% row.names(DT[DT != 0, , drop = FALSE]), 
    ]

Now apply the coolmap function to the subset matrix, saving the results in the variable HM.

HM <- coolmap(logCPM_sig_res)

For this heatmap the region \(\log_2\)-CPM has been scaled across the rows to standardize the region coloring. Those regions with dark red are highly expressed, those with dark blue are lowly expressed, and those with white show very little change. While this heatmap is very clear, it is essentially impossible to reads the row names due to the high number of DB regions. We are able to create an interactive heatmap using the plotly package. The commands to re-render the coolmap heatmap are as follows:

HMdata <- as.data.frame(HM$carpet)
plot_ly(x = rownames(HMdata), y = colnames(HMdata), z = t(HMdata), type = "heatmap", 
    colors = colorRamp(c("blue", "white", "red"))) %>% layout(yaxis = list(showticklabels = FALSE, 
    ticklen = 0))

By moving your mouse over the heatmap you can see the sample, region ID and value of each cell.

This brings you to the end of the ChIP-seq workshop! We hope you have enjoyed the workshop and found it helpful. Please be sure to collect a fully work version of this document before you leave.